Skip to content

fix: extract Docling async markdown result#3031

Open
he-yufeng wants to merge 2 commits intoHKUDS:devfrom
he-yufeng:fix/docling-markdown-extraction
Open

fix: extract Docling async markdown result#3031
he-yufeng wants to merge 2 commits intoHKUDS:devfrom
he-yufeng:fix/docling-markdown-extraction

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

Summary

  • align the Docling async parser defaults with docling-serve v1: port 5001 examples, files upload field, task_status polling, and /v1/status/poll/{task_id}
  • add a result URL template fallback for services that return only task status while exposing results at /v1/result/{task_id}
  • extract document.md_content from the Docling result JSON instead of returning the whole response envelope as raw document text

Verified locally

  • python -m pytest tests\test_pipeline_release_closure.py -q -k "docling or mineru_empty_service_result"
  • python -m ruff check lightrag\pipeline.py tests\test_pipeline_release_closure.py
  • python -m ruff format --check lightrag\pipeline.py
  • python -m py_compile lightrag\pipeline.py tests\test_pipeline_release_closure.py
  • git diff --check

Note: running the full tests/test_pipeline_release_closure.py file on Windows still hits two existing path-separator assertions unrelated to this change.

Closes #2996

@he-yufeng he-yufeng force-pushed the fix/docling-markdown-extraction branch from 3c02f33 to 606ed09 Compare May 8, 2026 15:51
Copy link
Copy Markdown
Contributor Author

Rebased this onto the latest dev after the pipeline refactor and resolved the lightrag/pipeline.py conflict. The Docling async defaults are now applied in the current parse_docling protocol config rather than the old pre-refactor location.

Validation on the rebased head 606ed098:

  • python -m uv run pytest tests\test_pipeline_release_closure.py::test_parse_docling_uses_docling_serve_async_defaults tests\test_pipeline_release_closure.py::test_protocol_parse_service_extracts_docling_result_markdown tests\test_pipeline_release_closure.py::test_parse_docling_empty_service_result_raises_without_fallback tests\test_pipeline_release_closure.py::test_parse_mineru_empty_service_result_raises_without_fallback -q
  • python -m uv run ruff check lightrag\pipeline.py tests\test_pipeline_release_closure.py
  • python -m py_compile lightrag\pipeline.py tests\test_pipeline_release_closure.py
  • git diff --check upstream/dev..HEAD

@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

@danielaskdd danielaskdd added the tracked Issue is tracked by project label May 8, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 606ed098cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lightrag/pipeline.py
@he-yufeng he-yufeng force-pushed the fix/docling-markdown-extraction branch 2 times, most recently from 436945c to 6a69155 Compare May 8, 2026 22:20
@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a69155b8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/FileProcessingConfiguration-zh.md
@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27a31e4400

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lightrag/pipeline.py
except Exception:
return text
content_val = get_by_path(payload, content_field)
return _string_content(content_val) or text
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty Docling content instead of indexing envelope

When Docling returns a JSON result whose configured content field is present but empty (for example a blank document, skipped conversion, or a conversion result with document.md_content: ""), this fallback returns the entire JSON envelope as document text. That bypasses the existing Docling parser returned empty content guard in parse_docling and can index status/errors/metadata instead of extracted markdown; distinguish “JSON parsed but content is empty/missing” from “non-JSON raw text” here rather than falling back to text.

Useful? React with 👍 / 👎.

Comment thread env.example
# DOCLING_CONTENT_FIELD=document.md_content
# DOCLING_FILE_FIELD=files
# DOCLING_SUCCESS_VALUES=done,success,completed
# DOCLING_FAILED_VALUES=failed,error,cancelled
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Include Docling's failure status in the sample override

If a user uncomments this Docling block, DOCLING_FAILED_VALUES overrides the code default that now includes failure. Docling v1 reports failed async tasks as task_status: "failure", so with this sample configuration a failed conversion is treated as still pending until DOCLING_MAX_POLLS expires instead of raising the parse-service failure promptly; add failure to the example list.

Useful? React with 👍 / 👎.

@danielaskdd
Copy link
Copy Markdown
Collaborator

The FileProcessingConfiguration-zh.md has been restructured. Please merge the latest dev branch and update the Docling usage instructions within the new document architecture.

@he-yufeng he-yufeng force-pushed the fix/docling-markdown-extraction branch from 27a31e4 to b057844 Compare May 9, 2026 16:58
@he-yufeng
Copy link
Copy Markdown
Contributor Author

Rebased onto the latest dev branch and updated the Chinese file-processing doc in the new structure. The Docling quick-start endpoint now matches the current 5001 examples, and the obsolete duplicated full_docs section from the old layout is gone.

Validation:

  • python -m py_compile lightrag\pipeline.py tests\test_pipeline_release_closure.py
  • python -m uv run ruff check lightrag\pipeline.py tests\test_pipeline_release_closure.py
  • python -m uv run pytest tests\test_pipeline_release_closure.py::test_parse_docling_uses_docling_serve_async_defaults tests\test_pipeline_release_closure.py::test_protocol_parse_service_extracts_docling_result_markdown tests\test_pipeline_release_closure.py::test_protocol_parse_service_raises_on_docling_failure_status tests\test_pipeline_release_closure.py::test_parse_docling_empty_service_result_raises_without_fallback tests\test_pipeline_release_closure.py::test_parse_mineru_empty_service_result_raises_without_fallback -q (5 passed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tracked Issue is tracked by project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants